\section{Additional Benchmark Details}\label{sec:apdx_data}

%\subsection{Dataset Construction}

\paragraph{QNLI} 
To construct a balanced dataset, we select all pairs in which the most similar sentence to the question was \emph{not} the answer sentence, as well as an equal amount of cases in which the correct sentence was the most similar to the question, but another distracting sentence was a close second. Our similarity metric is based on CBoW representations with pre-trained GloVe embeddings. % TODO: L2 or Euclidean?
This approach to converting pre-existing datasets into NLI format is closely related to recent work by \citet{white2017inference}, as well as to the original motivation for textual entailment presented by \citet{dagan2006pascal}. Both argue that many NLP tasks can be productively reduced to textual entailment. 

% SB: This is in the main body.
% \paragraph{RTE} The classes for the included RTE datasets is either two way (\textit{entailment}, \textit{not\_entailment}) or three way (\textit{entailment}, \textit{neutral}, \textit{contradiction}). We convert all the data to a two-class split, where we collapse \textit{neutral} and \textit{contradiction} into \textit{not\_entailment}, for consistency.
